Introduction

Background and purpose of this project

On March 19th, cherry blossoms in the city of Osaka, Japan saw the first blooming – the record earliest. Eleven other Japanese cities recorded its cherry blossoms’ earliest-ever first-blooming this year, and meteorologists say first-bloom dates are getting earlier in recent years in various locations in Japan.

This project aims to illustrate the overall trend of Japanese cherry blossoms’ first-blooming dates and to analyze some of the theories and workings behind the first-bloom dates.

Exploratory data analysis questions

In this project, I’m going to analyze the following questions:

What is an overall trend of Japanese cherry blossom’s first-bloom dates over time? To confirm whether the trend pointed out by meteorologists is true, I’m going to analyze the average first-bloom dates in various locations in Japan over time.

Is 400-degree theory true? What about 600-degree theory? There are several theories about cherry blossom’s first-blooming dates. One of them is the 400-degree theory, which claims the first-blooming happens when the cumulative daily average temperature since February 1st reaches 400 degrees (Celsius). Another is the 600-degree theory. This claims the first-blooming occurs when the cumulative daily high temperature since February 1st reaches 600 degrees (Celsius). I’m going to analyze to see if these theories are correct.

Data

I’m using a data set about cherry blossom’s first-bloom dates since 1953 in dozens of locations in Japan, published by Japan Meteorological Agency. I downloaded a cleaned version posted on Kaggle.com.

This data set includes cherry blossom’s first-blooming dates in more than 100 locations in Japan where the first-blooming were ever observed by Japan Meteorological Agency. For this project, I’m going to focus on 58 locations where they currently conduct observations.

By matching those locations in the data set and their addresses available on the agency’s list of observatories one by one, I confirmed those locations are distributed fairly equally in geographical locations in Japan.

I’m also using data about daily high and average temperature in six major locations in Japan to analyze the third question. Those data sets were directly downloaded from Japan Meteorological Agency, by using the agency’s downloading system.

knitr::opts_chunk$set(echo = TRUE)

# load libraries I need
library(tidyverse)
library(fs)
library(lubridate)
library(moments)
# load cherry blossom's first-blooming data, downloaded from Kaggle
sakura_raw <- read_csv("data/sakura_first_bloom_dates.csv")

# tidy data
sakura_gathered <- sakura_raw %>% 
  select(-`30 Year Average 1981-2010`, -Notes) %>% 
  gather(key = "year", value = "date", "1953":"2020") %>% 
  rename(location = `Site Name`,
         now_observed = `Currently Being Observed`)

# filter out locations currently not observed, calculate date of the year
sakura <- sakura_gathered %>% 
  filter(now_observed == TRUE) %>% 
  mutate(
    date_x = ymd(date),
    floor_date = as.Date(paste(year, "-01-01", sep = ""), format = "%Y-%m-%d"),
    doy_x = interval(floor_date, date_x) %>% time_length(unit = "day")) %>% 
  select(-floor_date, -now_observed, -date)

head(sakura)
## # A tibble: 6 x 4
##   location  year  date_x     doy_x
##   <chr>     <chr> <date>     <dbl>
## 1 Wakkanai  1953  1953-05-21   140
## 2 Asahikawa 1953  1953-05-11   130
## 3 Abashiri  1953  1953-05-24   143
## 4 Sapporo   1953  1953-05-07   126
## 5 Obihiro   1953  1953-05-15   134
## 6 Kushiro   1953  NA            NA

The cleaned data set has 3944 rows and 4 columns. Each row shows an observation, and columns are as follows:

  • location: the location the cherry blossom’s blooming was observed
  • year: the year of observation
  • date: date of first-blooming (Year-Month-Day)
  • doy: date of first-blooming, converted as the number of days after January 1st

I’m computing the dates as number of days after January 1st. About 60 days after January 1st indicates March 1st, about 90 indicates April 1st, and about 120 is May 1st. Negative numbers mean cherry blossoms saw the first-bloom before January 1st that year; in all cases found in this project, they were in late December the year before. I still count those cases as first-blooming in the coming spring (thus negative number of days after January 1st).

For all of the tests in this project, I’m going to apply the alpha value of 0.05.

Qestion 1: Overall trend of first-blooming dates over time

In this section, I’m going to compute the yearly average first-blooming date of all 58 locations from 1953 to 2020.

Plotting all the observed dates and yearly average

First, in order to see the general trend over years, I’m going to plot all the observed first-bloom dates and their yearly average.

# plot all observed first-blooming dates 
sakura %>% 
  group_by(year) %>% 
  ggplot(aes(x = as.numeric(year), y = doy_x)) +
  geom_point(color = "#e7298a", alpha = 0.1) +
  geom_smooth(color = "#000000") +
  scale_x_continuous(breaks = c(1955, 1960, 1965, 1970, 1975, 1980, 1985, 1990, 
                                1995, 2000, 2005, 2010, 2015, 2020)) +
  labs(title = "Distribution of cherry blossom's first-bloom days and its average",
       subtitle = "From 1953 to 2020 in 58 locations in Japan",
       x = "Year", y = "Number of days after January 1st") +
  theme_minimal()

In the visualization above, the circles indicate the number of days after January 1st when cherry blossoms saw the first blooming. Overall, the circles are overlapping each other around 75 to 100 days after the new year, meaning mid-March to early-April.

The black line on top of the circles indicates a smoothed regression line. As shown above, cherry blossoms are blooming earlier over time.

Analyzing the distribution of dates in 2020

Next, I’m going to look deeper into the distribution of first-blooming dates in a particular year. As an example, I’m analyzing the data from the year 2020.

I’m firstly going to compute the measures of central tendency and variability.

# filter data about the year 2020
sakura.2020 <- sakura %>% 
  filter(year == "2020")

# compute central tendency and variability
mean(sakura.2020$doy_x) # Mean: 84.93103
## [1] 84.93103
median(sakura.2020$doy_x) # Median: 84
## [1] 84
sd(sakura.2020$doy_x) # Standard deviation: 23.77516
## [1] 23.77516
max(sakura.2020$doy_x) - min(sakura.2020$doy_x) # Range: 125
## [1] 125

As a result of the computation above, the average first-blooming date in 2020 is about 85 days after January 1st, which is March 26th. The median is 84, meaning March 25th. The standard deviation of the distribution of this data frame is 23.7, and the first-blooming dates ranged 125 days in 2020; the earliest blooming was in Naha, Okinawa on January 6th, while the latest was observed in Wakkanai and Kushiro, both in Hokkaido, on May 10th.

Next, I’m going to plot the distribution and test if it is normally distributed.

# plot distribution
sakura.2020 %>% 
  ggplot(aes(x = doy_x)) +
  geom_histogram(binwidth = 5, fill = "#e7298a", color = "white", alpha = 0.8) +
  labs(title = "Distribution of cherry blossom's first-blooming dates in 2020",
       subtitle = "In 58 locations in Japan",
       x = "Number of days after January 1st", y = "Count") +
  theme_minimal()

From this visualization, I can say this distribution has a single hump and that it looks quite leptokurtic. From the calculation above, mean is slightly larger than the median, suggesting a slight positive skew. However, because of some outliers close to the lower end, the distribution may also look slightly negatively skewed. In any ways, there are so many missing points, especially in the lower half of the distribution, that I can’t quite say if it is skewed in which way.

In order to learn more about this distribution, I’m going to compute its skewness and kurtosis, as well as conduct the Shapiro–Wilk test to test its normality.

# calculate skewness and kurtosis
skewness(sakura.2020$doy_x) # -0.883243
## [1] -0.883243
kurtosis(sakura.2020$doy_x) # 5.473665
## [1] 5.473665
# conduct the shapiro-wilk test
shapiro.test(sakura.2020$doy_x) # p-value = 3.987e-07 ... not normal
## 
##  Shapiro-Wilk normality test
## 
## data:  sakura.2020$doy_x
## W = 0.8125, p-value = 3.987e-07

First, the skewness is -0.88, indicating a slight negative skew. The kurtosis is 5.47, and it is strongly leptokurtic.

In the Shapiro-Wilk test, the hypotheses are:

  • Null hypothesis: the distribution of number of dates is normally distributed
  • Alternative hypothesis: the distribution of number of dates is not normally distributed

As a result of the test, the p-value is extremely small. Since the alpha level is 0.05, the p-value is below the alpha level. Thus, I reject the null hypothesis and accept the alternative hypothesis; the distribution is not normal.

Analysis of 400-degree theory and 600-degree theory

In this section, I’m going to analyze the data set to test if the 400-degree theory and 600-degree theory are true.

For each location, I obtained the daily average and high temperature from 1953 to 2020. I’m going to identify the date when both theories claim to be the first-blooming date, and compare the date to the actual blooming date for that location every year. I’m then conducting the one-sample t-test to see if the difference between expected first-blooming date and actual date is statistically significant.

Load all the temperature data

file_paths <- dir_ls("data new/daily-temp")
file_contents <- list()
file_names <- list()

# load temperature data for 58 observation locations
for(i in 1:58) {
  file_names[[i]] <- str_sub(file_paths[[i]], start = 21, end = -5)
  file_contents[[i]] <- read_csv(file_paths[[i]])
}
file_contents <- set_names(file_contents, file_names)

# identify expected first-blooming date based on 400- and 600- theories
theory400 <- list()
theory600 <- list()

for(i in 1:58) {
  file_contents[[i]] <- file_contents[[i]] %>% 
    mutate(date = ymd(date),
           avg_temp = as.numeric(avg_temp),
           high_temp = as.numeric(high_temp),
           year = year(date)) %>% 
  group_by(year) %>% 
  mutate(cum_high_temp = cumsum(high_temp),
         cum_avg_temp = cumsum(avg_temp))
  
  theory400[[i]] <- file_contents[[i]]
  theory600[[i]] <- file_contents[[i]]
  
  theory400[[i]] <- theory400[[i]] %>% 
  filter(cum_avg_temp > 400) %>% 
  filter(cum_avg_temp == min(cum_avg_temp)) %>% 
  mutate(
    date_mu = date,
    floor_date = floor_date(date, unit = "year"),
    doy_mu = interval(floor_date, date_mu) %>% time_length(unit = "day")
  ) %>% 
  select(date_mu, doy_mu, year)
  
  theory600[[i]] <- theory600[[i]] %>% 
  filter(cum_high_temp > 600) %>% 
  filter(cum_high_temp == min(cum_high_temp)) %>% 
  mutate(
    date_mu = date,
    floor_date = floor_date(date, unit = "year"),
    doy_mu = interval(floor_date, date_mu) %>% time_length(unit = "day")
  ) %>% 
  select(date_mu, doy_mu, year)
}

# calculate difference between expected and actual first bloom dates
for(i in 1:58) {
  theory400[[i]] <- sakura %>% 
    filter(location == file_names[[i]]) %>% 
    merge(theory400[[i]]) %>% 
    mutate(diff = doy_x - doy_mu)
  
  theory600[[i]] <- sakura %>% 
    filter(location == file_names[[i]]) %>% 
    merge(theory600[[i]]) %>% 
    mutate(diff = doy_x - doy_mu)  
}

Now, I’m going to conduct the one-sample t-test for each distribution and the difference between dates are statistically significant.

In this test, the hypotheses are:

# conduct one-sample t-test for all locations
theory400_pVal <- NULL
theory400_list <- NULL
theory600_pVal <- NULL
theory600_list <- NULL

for (i in 1:58) {
  theory400_pVal <- rbind(theory400_pVal,
                          t.test(x = theory400[[i]]$diff, mu = 0)$p.value)
  theory400_list[[i]] <- list(location = file_names[[i]], pVal = theory400_pVal[[i]])
  theory600_pVal <- rbind(theory600_pVal,
                          t.test(x = theory600[[i]]$diff, mu = 0)$p.value)
  theory600_list[[i]] <- list(location = file_names[[i]], pVal = theory600_pVal[[i]])
}

Visualize p-values on the map

# load observatory locations data
observatory_locations <- read_csv('data new/observatory-locations.csv')
## Parsed with column specification:
## cols(
##   location = col_character(),
##   lat = col_double(),
##   long = col_double()
## )
# bind p-value data with observatory locations data
theory400_location <- as.data.frame(do.call(rbind, theory400_list)) %>% 
  mutate(is_above_alpha = ifelse(pVal >= 0.05, 1, 0))
theory400_merged <- cbind(theory400_location, observatory_locations) %>% 
  select(-4)

theory600_location <- as.data.frame(do.call(rbind, theory600_list)) %>% 
  mutate(is_above_alpha = ifelse(pVal >= 0.05, 1, 0))
theory600_merged <- cbind(theory600_location, observatory_locations) %>% 
  select(-4)

For this visualization, I’m using a custom mapping tool, Mapbox.

I created the style for myself on Mapbox Studio. You can preview the style here: 400 theory: https://api.mapbox.com/styles/v1/yuriko-schumacher/ckntdx44900zd17ladmwyyekn.html?fresh=true&title=view&access_token=pk.eyJ1IjoieXVyaWtvLXNjaHVtYWNoZXIiLCJhIjoiY2ttNDVoemgyMDFjcDJxdXM5cWx5d3FzdiJ9.Ajc4ZM1IbKLbRPSkrBJNrA 600 theory: https://api.mapbox.com/styles/v1/yuriko-schumacher/ckntjiflk05c118pijm44efkg.html?fresh=true&title=view&access_token=pk.eyJ1IjoieXVyaWtvLXNjaHVtYWNoZXIiLCJhIjoiY2ttNDVoemgyMDFjcDJxdXM5cWx5d3FzdiJ9.Ajc4ZM1IbKLbRPSkrBJNrA#4.05/34.77/141.49

# visualize data about p-values based on 400-degree theory on the map 
library(mapdeck)
## 
## Attaching package: 'mapdeck'
## The following object is masked from 'package:tibble':
## 
##     add_column
key <- "pk.eyJ1IjoieXVyaWtvLXNjaHVtYWNoZXIiLCJhIjoiY2ttNDVoemgyMDFjcDJxdXM5cWx5d3FzdiJ9.Ajc4ZM1IbKLbRPSkrBJNrA"
l1 <- legend_element(
        variables = c("above 0.05","below 0.05")
        , colours = c("#FF0080","#ccc")
        , colour_type = "stroke"
        , variable_type = "category"
        , title = "p-value"
)

mapdeck(token = key, style = "mapbox://styles/yuriko-schumacher/ckntdx44900zd17ladmwyyekn",
        location =  c(140.401403, 35.443116),
        zoom = 5) %>% 
  add_scatterplot(
    data = theory400_merged
    , fill_opacity = 0
    , legend = mapdeck_legend(l1)
  )
## Registered S3 method overwritten by 'jsonify':
##   method     from    
##   print.json jsonlite

As shown above, there are five locations (Takamatsu, Tokyo, Tsu, and Yokohama) where 400-degree theory hold true.

# visualize data about p-values based on 600-degree theory on the map 
mapdeck(token = key, style = "mapbox://styles/yuriko-schumacher/ckntjiflk05c118pijm44efkg",
        location =  c(140.401403, 35.443116),
        zoom = 3.5) %>% 
  add_scatterplot(
    data = theory600_merged
    , fill_opacity = 0
    , legend = mapdeck_legend(l1)
  )

There are three locations (Kanazawa, Sendai, and Toyama) where 600-degree theory hold true.

Summary of all six locations

Finally, I’m going to visualize the distribution of difference between the actual first-blooming date and expected first-blooming date in both theories in all 58 locations.

I plot difference between expected first-blooming date (based on theories) and actual date on the x-axis, and latitude on the y-axis (higher on the y-axis means north of Japan). I color locations with p-values above alpha threshold as pink.

theory400_unlisted <- do.call(rbind.data.frame, theory400) %>% 
    mutate(is_above_alpha = ifelse(location %in% c("Takamatsu", "Tokyo", "Tsu", "Yokohama"), TRUE, FALSE)) %>% 
    mutate(location = fct_relevel(location, "Wakkanai", "Abashiri", "Asahikawa", "Sapporo", "Kushiro", "Obihiro", "Muroran", "Hakodate", "Aomori", "Akita", "Morioka", "Sendai", "Yamagata", "Niigata", "Fukushima", "Toyama", "Nagano", "Kanazawa", "Utsunomiya", "Maebashi", "Mito", "Kumagaya", "Fukui", "Tokyo", "Choshi", "Kofu", "Tottori", "Matsue", "Yokohama", "Gifu", "Hikone", "Nagoya", "Kyoto", "Shizuoka", "Tsu", "Kobe", "Osaka", "Nara", "Okayama", "Hiroshima", "Takamatsu", "Wakayama", "Tokushima", "Shimonoseki", "Matsuyama", "Fukuoka", "Kochi", "Saga", "Oita", "Kumamoto", "Nagasaki", "Miyazaki", "Kagoshima", "Naze", "Naha", "Minamidaitojima", "Miyakojima", "Ishigaki Island"))


theory400_unlisted %>% 
  ggplot(aes(x = diff, y = desc(location)), size = 0.1) +
  geom_vline(xintercept = 0) +
  geom_point(aes(color = is_above_alpha), alpha = 0.3) + 
  scale_color_manual(values = c("#cccccc", "#FF0080")) +
    labs(title = "Differences between the actual date and expected date on 400 degrees theory",
       x = "Difference in days", y = "Latitude") +
  theme_minimal() +
  theme(axis.text.y = element_blank()) 
## Warning: Removed 128 rows containing missing values (geom_point).

theory600_unlisted <- do.call(rbind.data.frame, theory600) %>% 
    mutate(is_above_alpha = ifelse(location %in% c("Kanazawa", "Sendai", "Toyama"), TRUE, FALSE)) %>% 
    mutate(location = fct_relevel(location, "Wakkanai", "Abashiri", "Asahikawa", "Sapporo", "Kushiro", "Obihiro", "Muroran", "Hakodate", "Aomori", "Akita", "Morioka", "Sendai", "Yamagata", "Niigata", "Fukushima", "Toyama", "Nagano", "Kanazawa", "Utsunomiya", "Maebashi", "Mito", "Kumagaya", "Fukui", "Tokyo", "Choshi", "Kofu", "Tottori", "Matsue", "Yokohama", "Gifu", "Hikone", "Nagoya", "Kyoto", "Shizuoka", "Tsu", "Kobe", "Osaka", "Nara", "Okayama", "Hiroshima", "Takamatsu", "Wakayama", "Tokushima", "Shimonoseki", "Matsuyama", "Fukuoka", "Kochi", "Saga", "Oita", "Kumamoto", "Nagasaki", "Miyazaki", "Kagoshima", "Naze", "Naha", "Minamidaitojima", "Miyakojima", "Ishigaki Island"))


theory600_unlisted %>% 
  ggplot(aes(x = diff, y = desc(location)), size = 0.1) +
  geom_vline(xintercept = 0) +
  geom_point(aes(color = is_above_alpha), alpha = 0.3) + 
  scale_color_manual(values = c("#cccccc", "#FF0080")) +
    labs(title = "Differences between the actual date and expected date on 600 degrees theory",
       x = "Difference in days", y = "Latitude") +
  theme_minimal() +
  theme(axis.text.y = element_blank()) 
## Warning: Removed 128 rows containing missing values (geom_point).

These visualizations suggest the difference between expected and actual first blooming dates have something to do with latitude.

Based on the t-test in the previous section, 400-degree theory only holds true in four locations, while 600-degree theory only holds true in three locations.

However, from the visualizations above, also in other locations, the distribution of difference in days based on those theories are concentrated closer to 0 days. In those places, the expected date calculated out of daily temperatures could be one of the indicators to consider cherry blossom’s first-blooming date.

Summary

Findings

Here are the major findings from exploratory data analyses conducted above.

  1. Overall, cherry blossom’s first-blooming date are getting earlier and earlier in Japan.

  2. Out of six major cities in Japan (Sapporo, Sendai, Tokyo, Osaka, Fukuoka, and Naha), the 400-degree theory only holds true in Tokyo, and the 600-degree theory only holds true in Sendai. However, in cities like Osaka and Fukuoka, the first-blooming dates identified by those theories could be considered as one of the indicators to expect the first-blooming dates.

Further analysis

  1. For the general trend of first-blooming date, it would be better to analyze if the margin decrease of number of dates until first-blooming is statistically significant.

  2. For the third question, I would be curious to see what other climate data can be used to indicate cherry blossom’s first-blooming date.